Fujitsu Laboratories Trec9 Report 1 System Description 2 Common Processing 2.1 Indexing/query Processing 2.1.1 Indexing Vocabulary 2.1.2 Stemmer 2.1.4 Stop Word List for Query Processing

نویسنده

  • Isao Namba
چکیده

This year a Fujitsu Laboratory team participated in web tracks. For TREC9 we experimented passage retrieval which is expected to be e ective for Web pages which contain more than one topic. To split document into passages, we used NLP based paragrah detecting program, not by xed (variable) window size. But it did not produce better result for TREC9 Web data. For indexing large web data faster, we developped two techiniques. One is multi-partional selective sorting for inversion which is about 10-30% faster than normal quick sorting in sorting term-number, text-number pair. The other is compressed trie dictionary based stemming. 1 System Description Except reranking by passage retrieval, and passage segementing program for index preprocessing, the frame work we used, is same as that of TREC8[1].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fujitsu Laboratories Trec8 Report 1 System Description 1.0.1 Tera 2 Common Processing

This year a Fujitsu Laboratory team participated in three tracks:that is ad hoc, small web track, and large web track. As basic techiniques, we compared four popular stemmers, and we made simple removing stop pattern techniques for TREC queries. For the ad hoc task, and small web track, we used the same techiniques. We experimented with area weighting, co-occurence boosting, bi-gram utlization,...

متن کامل

Fujitsu Laboratories TREC8 Report - Ad hoc, Small Web, and Large Web Track

This year a Fujitsu Laboratory team participated in three tracks:that is ad hoc, small web track, and large web track. As basic techiniques, we compared four popular stemmers, and we made simple removing stop pattern techniques for TREC queries. For the ad hoc task, and small web track, we used the same techiniques. We experimented with area weighting, co-occurence boosting, bi-gram utlization,...

متن کامل

Lucene for n-grams using the CLUEWeb Collection

The ARSC team made modifications to the Apache Lucene engine to accommodate " go words, " taken from the Google Gigaword vocabulary of n‐grams. Indexing the Category " B " subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3‐grams. Phrase searching—or imposing an order on query terms—has traditionally been a...

متن کامل

TREC Chemical IR Track 2009: A Distributed Dimensional Indexing Model for Chemical Patent Search

For the TREC-2009 Chemical IR Track, we explore development of a distributed information retrieval system based on a dimensional data model. The indexing model supports named entity identification and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system was deployed across 15 Amazon W...

متن کامل

Hybrid Term Indexing: an Evaluation

Retrieval effectiveness depends on how terms are extracted and indexed. For Chinese text (and others like Japanese and Korean), there are no space to delimit words. Indexing using hybrid terms (i.e. words and bigrams) were able to achieve the best precision amongst homogenous terms at a lower storage cost than indexing with bigrams. However, this was tested with conjunctive queries, using a sma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999